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access only to technical personnel that have been qualified by Intel Cor¬ 
poration. 

CAUTION 

This equipment has been tested and found to comply with the limits for a 
Class A digital device, pursuant to Part 15 of the FCC Rules. These limits 
are designed to provide reasonable protection against harmful interfer¬ 
ence when the equipment is operated in a commercial environment. This 
equipment generates, uses, and can radiate radio frequency energy and, 
if not installed and used in accordance with the instruction manual, may 
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expense. 

LIMITED RIGHTS 

The information contained in this document is copyrighted by and shall re¬ 
main the property of Intel Corporation. Use, duplication or disclosure by 
the U.S. Government is subject to Limited Rights as set forth in subpara¬ 
graphs (a)(15) of the Rights in Technical Data and Computer Software 
clause at 252.227-7013. Intel Corporation, 2200 Mission College Boule¬ 
vard, Santa Clara, CA 95052. For all Federal use or contracts other than 
DoD Limited Rights under FAR 52.2272-14, ALT. Ill shall apply. Unpub¬ 
lished-rights reserved under the copyright laws of the United States. 








II 

II 

II 

II 

II 

IX 

IX 

c 

IX 

II 

II 

II 

II 

II 

II 

IX 

EE 

jti 


IE 


C 

i: 

ii 

ii 

K tb 

k 

K m 

.k 

II 

II 

II 

IS 

EE 

H 


iv 




Preface 


Organization 

Chapter 1 Level 1 BLAS Performance Evaluation 

Chapter 2 Level 2 BLAS Performance Evaluation 

Chapter 3 Level 3 BLAS Performance Evaluation 


Notational Conventions 


This manual uses the following notational conventions: 

Bold Identifies command names and switches, system call names, reserved words, 

and other items that must be used exactly as shown. 

Italic Identifies variables, filenames, directories, processes, user names, and writer 

annotations in examples. Italic type style is also occasionally used to 
emphasize a word or phrase. 

Plain-Monospace 

Identifies computer output (prompts and messages), examples, and values of 
variables. Some examples contain annotations that describe specific parts of 
the example. These annotations (which are not part of the example code or 
session) appear in italic type style and flush with the right margin. 

Bold-Italic-Monospace 

Identifies user input (what you enter in response to some prompt). 
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Bold-Monospace 

Identifies the names of keyboard keys (which are also enclosed in angle 
brackets). A dash indicates that the key preceding the dash is to be held down 
while the key following the dash is pressed. For example: 

<Break> <s> <Ctrl-Alt-Del> 

[ ] (Brackets) Surround optional items. 

(Ellipsis dots) Indicate that the preceding item may be repeated. 

| (Bar) Separates two or more items of which you may select only one. 

{ } (Braces) Surround two or more items of which you must select one. 

Applicable Documents 

For more information, refer to the Paragon ™ System Technical Documentation Guide. 

For information about limitations and workarounds, see the Paragon ™ System Software Release 
Notes for the Paragon M XPIS System. Release notes are also located in the directory 
/vollsharelreleasejiotes on your Paragon system. 
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Comments and Assistance 

Intel Supercomputer Systems Division is eager to hear of your experiences with our products. Please 
call us if you need assistance, have questions, or otherwise want to comment on your Paragon 
system. 


U.S.A./Canada Intel Corporation 
phone: 800-421-2823 
Internet: support@ssd.intel.com 


Intel Corporation Italia s.p.a. 

Milanofiori Palazzo 
20090 Assago 
Milano 
Italy 

1678 77203 (toll free) 

France Intel Corporation 

1 Rue Edison-BP303 

78054 St. Quentin-en-Yvelines Cedex 

France 

0590 8602 (toll free) 

Intel Japan K.K. 

Supercomputer Systems Division 

5-6 Tokodai, Tsukuba City 
Ibaraki-Ken 300-26 
Japan 

0298-47-8904 


United Kingdom Intel Corporation (UK) Ltd. 
Supercomputer System Division 

Pipers Way 
Swindon SN3 IRJ 
England 

0800 212665 (toll free) 

(44) 793 491056 ( answered in French ) 

(44) 793 431062 ( answered in Italian ) 

(44) 793 480874 (answered in German ) 

(44) 793 495108 (answered in English ) 

Germany Intel Semiconductor GmbH 

Domacher Strasse 1 

8016 Feldkirchen bel Muenchen 

Germany 

0130 813741 (toll free) 


World Headquarters 
Intel Corporation 
Supercomputer Systems Division 

15201 N.W. Greenbrier Parkway 
Beaverton, Oregon 97006 
U.S.A. 

(503) 629-7600 (Monday through Friday, 8 AM to 5 PM Pacific Time) 
Fax: (503) 629-9147 


If you have comments about our manuals, please fill out and mail the enclosed Comment Card. You 
can also send your comments electronically to the following address: 


techpubs@ssd.intel.com 
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Level 1 BLAS Performance Evaluation 


Introduction 


The Basic Math Library provides the user with a collection of routines that include the Level 1, 2 
and 3 BLAS, a variety of FFT routines, tri-pentadiagonal factor and solve routines, and some vector 
triads. Many of the routines in this library are written in i860™ Assembler language and are highly 
tuned for the i860XP processor. The i860™ XP processor runs at 50 Mhz, and is capable of 
achieving 75 double precision Mfiops and 100 single precision Mflops. A more realistic double 
precision performance peak is 50 Mflops, since it is difficult to structure many linear algebra 
computations to complete two additions and one multiply every cycle; the one add to one multiply 
is much more common. 

Only BLAS and FFT performance is addressed in this document. 

This document provides the user with a guide to enhance application performance by knowing the 
expected performance of a standard set of commonly used subroutines provided with the system 
software. 

The user should expect only slight variations from the performance levels documented here. The 
results for the Level 2 and 3 BLAS can be duplicated by running the BLAS test suites provided with 
the system acceptance tests (SAT). 


Level 1 BLAS Performance 

The Level 1 BLAS routines perform basic vector-vector operations. For the performance evaluation 
of the Level 1 BLAS, only unit stride is used. The vector lengths (N) are varied from 100 to 1700 
and for each vector length a corresponding MFLOPS (millions of floating-point operations per 
second) rating is calculated. Along with the performance characterization of each routine, the 
routines are also tested for correctness. 






Level 1 BLAS Performance Evaluation 


Paragon™ Basic Math Library Performance Report 


The four tables that are given that display the performance of the real and complex Level 1 BLAS 
in both single and double precision. 


Table 1-1. Level 1 BLAS: Single-Precision Performance (MFLOPS) 


N 

SASUM 

SAXPY 

SDOT 

SDSDOT 

SROT 

SSCAL 

100 

8.6 

27.3 

38.3 

38.0 

42.9 

25.1 

200 

9.0 

32.9 

52.6 

49.3 

52.2 

30.0 

300 

9.1 

36.3 

60.8 

57.2 

53.6 

33.5 



39.1 

65.3 

62.3 

56.3 

34.1 



40.4 

68.8 


56.0 

35.6 



41.1 

72.5 

71.3 

57.6 

35.6 

mm 


42.2 

71.9 

73.2 

56.7 

36.7 

mm i 


42.8 

73.4 

70.3 

58.3 

36.5 


9.3 

43.2 

75.7 

72.2 

56.6 

37.3 

| 

9.3 

43.6 

77.4 

72.6 

57.9 

37.1 

1100 


44.0 

78.3 


56.7 

18.8 

1200 





56.4 

18.4 

1300 



81.9 


55.3 

18.6 

1400 


44.3 





1500 

9.4 

44.3 


81.2 

53.1 

18.2 


9.4 

44.7 


83.3 



1700 

9.5 



82.2 


i 
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Table 1-3. Level 1 BLAS: Single-Precision Complex Performance (MFLOPS) 


N 

SCASUM 

CAXPY 

CSCAL 

CSSCAL 

CDOTC 

CDOTU 

100 

9.2 

60.6 

56.2 

35.8 

68.6 

68.1 

200 


77.2 

64.1 

40.1 

86.6 

86.5 

300 


80.3 

65.9 

42.0 

89.4 

89.4 

400 


83.6 

67.3 

44.3 

91.3 

91.1 

500 

9.6 

85.1 

68.7 

46.1 

92.0 

92.0 

600 

9.6 

85.7 

69.6 

46.6 

92.7 

92.6 

700 

9.7 

86.0 

70.5 

46.3 

93.5 

93.3 

800 

9.7 

86.2 

70.4 

46.6 

93.5 

93.3 

900 

9.7 

86.9 

70.8 

47.1 

93.7 

93.3 

1000 

9.6 

87.2 

70.7 

47.3 

94.1 

94.5 


9.6 

87.3 

71.7 

46.9 


94.5 


9.6 



46.9 


94.0 

1300 

9.6 



47.0 



1400 







1500 

9.6 



47.1 



1600 

9.5 



47.1 



1700 


| 


46.1 

94.1 
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Table 1-4. Level 1 BLAS: Double-Precision Complex Performance (MFLOPS) 


N 

DZASUM 

ZAXPY 

ZSCAL 

ZDSCAL 

ZDOTC 

ZDOTU 


9.2 

| 

32.2 

60.2 

44.9 

45.4 


9.5 


34.7 

67.2 

45.5 

44.9 

300 


44.3 

35.2 

69.2 

48.6 

48.4 

400 

9.6 

44.5 

35.7 

70.1 

51.0 

50.4 

500 

9.6 

45.2 

35.9 

70.8 

51.7 

52.0 

600 


45.3 

35.8 

71.1 

52.3 

52.6 


9.6 

45.3 

35.5 

71.6 

53.4 

53.2 


9.6 

45.3 

35.9 

70.7 

52.7 

53.4 

900 

9.5 

44.5 

34.7 

68.5 

52.5 

52.2 

1000 

9.5 

44.4 

34.9 

64.8 

52.5 

52.3 

1100 

8.9 

33.0 

31.2 

52.8 

37.2 

37.1 

1200 

8.3 

33.0 

28.0 

41.5 

37.2 

37.2 

1300 

7.9 

33.1 

26.2 

35.5 

37.3 

37.3 

1400 

7.6 

33.2 

24.4 

31.6 

37.4 

37.4 

1500 

7.4 

33.3 

23.1 

28.4 

37.4 

37.4 

1600 

7.3 


23.1 

26.7 

37.3 

37.3 

1700 

7.1 

33.2 

23.1 

25.2 

37.4 

37.4 
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Level 2 BLAS Performance 

The Level 2 BLAS routines perform matrix-vector operations. The routines used to evaluate the 
performance of the Level 2 BLAS are adopted from the public domain LAPACK BLAS test 
programs. With each test routine, a data file is supplied that contains information such as the test 
ratio threshold value, the values of N, K, stride, ALPHA, BETA, and the name of the routine to be 
evaluated. 

The test programs are modified so that only unit stride was used and both ALPHA and BETA were 
neither zero nor one. The value of N was varied by factors of 2 from 8 to 512, and for each N the 
routine to be evaluated was called multiple times with different values of K, UPLO, TRANS, and/or 
DIAG. The value of M was set to both MAX=(N-N/2-1,0) and MIN(N+N/2+l ,512). A Mflop rating 
was calculated for each value of N as an average of all the calls. Along with this performance 
characterization, the routines were also tested for correctness. 

The four tables given display the performance (in MFLOPS) of both the real and complex Level 2 
BLAS in both single and double precision 
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Table 2-1. Level 2 BLAS: Single-Precision Performance (MFLOPS) 


N 

SGEMV 

SGBMV 

SSYMV 

SSBMV 

SSPMV 

STRMV 

8 

3.1 

3.1 

1.8 

3.1 

1.8 

1.7 

16 

6.9 


3.3 

5.2 

3.4 

3.5 

32 

16.2 

7.4 

5.4 

7.6 


6.5 

64 


8.9 


8.9 

boh 

14.0 



11.0 


11.5 

26.7 


256 

68.0 

16.3 

43.4 

14.7 

44.0 

| 

512 

76.2 

20.4 

57.6 

17.6 

60.2 

59.9 


N 

STBMV 

STPMV 

STRSV 

STBSV 

STPSV 

SGER 

8 

2.2 

2.4 

1.1 

1.1 

1.3 

3.7 

16 

3.7 

5.4 

2.5 

2.0 

3.3 

7.0 

32 

5.0 

9.0 

4.8 

2.5 

5.8 

15.6 

64 

5.8 

12.2 

10.4 

2.9 

9.1 

22.1 

128 

6.3 


20.7 

3.1 

18.7 

34.4 

256 

6.5 


34.4 

3.1 

32.5 

38.0 

512 

6.4 


50.9 

3.1 

48.4 

41.3 


N 


SSPR 

SSYR2 

SSPR2 

8 

1.9 

1.9 

5.1 

4.8 

16 

3.4 

3.3 

10.6 

10.6 

32 

5.9 

6.1 

16.9 

17.0 

64 

9.8 

10.6 

21.3 

21.4 

128 

17.6 

18.4 

31.6 

30.6 

256 

25.8 

27.5 

50.8 

50.6 

512 

34.3 

34.4 

66.5 

64.1 
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Table 2-2. Level 2 BLAS: Double-Precision Performance (MFLOPS) 
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Table 2-3. Level 2 BLAS: Complex Single-Precision Performance (MFLOPS) 
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Table 2-4. Level 2 BLAS: Complex Double-Precision Performance (MFLOPS) 
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Level 3 BLAS Performance 

The Level 3 BLAS perform matrix-matrix operations. The routines used to evaluate the performance 
of the Level 3 BLAS were adopted from the public domain LAPACK BLAS test programs. With 
each test routine, a data file is supplied that contains information such as the test ratio threshold 
value, the values of N, ALPHA, BETA, and the names of the routines to be evaluated. 

The test programs were modified so that both ALPHA and BETA were neither zero nor one. The 
value of N was varied from 8 to 512, with M and K equal to N. For each N, the routine to be evaluated 
was called multiple times with different values of UPLO, SIDE, TRANS, and/or DIAG. A Mflops 
rating was calculated for each value of N. Along with this performance characterization, the routines 
were also tested for correctness. 

The four tables that are given display the performance (in Mflops) of both the real and complex 
Level 3 BLAS in both single and double precision. 


Table 3-1. Level 3 BLAS: Single-Precision Real Performance (MFLOPS) 


N 

SGEMM 

SSYMM 

STRMM 

STRSM 

SSYRK 

SSYR2K 

8 

1.1 

2.9 

2.1 

1.7 

2.6 

2.3 

16 

4.1 

5.0 

4.3 

3.5 

5.1 

6.3 

32 

40.6 

22.2 

21.9 

18.4 

16.7 

17.0 

64 

57.7 

33.0 

41.6 

38.4 

37.3 

36.3 

128 

77.0 

56.2 

62.0 

58.2 

58.6 

58.0 
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I! 


Table 3-1. Level 3 BLAS: Single-Precision Real Performance (MFLOPS) 


N 

SGEMM 

SSYMM 

STRMM 

STRSM 

SSYRK 

SSYR2K 

256 

83.7 

65.4 

73.2 

71.4 

73.8 

70.8 

512 

87.8 

72.0 

81.2 

80.5 

81.6 

79.1 


Table 3-2. Level 3 BLAS: Double-Precision Real Performance (MFLOPS) 


N 

DGEMM 

DSYMM 

DTRSM 

DTRMM 

DSYRK 

DSYR2K 

8 


3.8 

1.6 

2.3 


2.8 

16 


8.3 

3.8 

4.9 

7.7 

7.7 

32 


19.1 



15.2 

19.4 

64 


29.9 

26.0 

29.8 

31.4 

30.4 

128 

45.1 

37.8 

37.0 

38.1 

39.7 

38.2 

256 

45.8 

41.0 

41.3 

41.4 

42.5 

41.2 

512 

45.9 

43.3 

43.7 

44.0 

44.2 

43.4 
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Table 3-3. Level 3 BLAS: Single-Precision Complex Performance (MFLOPS) 
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Table 3-4. Level 3 BLAS: Double-Precision Complex Performance (MFLOPS) 


N 

ZGEMM 

ZHEMM 

ZSYMM 

ZTRMM 

ZTRSM 

8 

20.6 

4.7 

6.9 

5.3 

4.6 

16 

33.3 

14.0 

13.7 

13.7 

12.0 

32 

45.0 

26.0 

25.1 

29.0 

30.3 

64 

54.1 

38.1 

38.3 

45.2 

44.9 

128 

57.5 

47.2 

46.9 

52.4 

52.9 

256 

59.3 

52.6 

52.4 

54.7 

55.7 

512 

58.4 

55.9 

55.9 

57.1 

58.0 


N 

ZHERK 

ZSYRK 

ZHER2K 

ZSYR2K 

8 

10.2 

12.3 

9.2 

5.6 

16 

19.6 

23.7 

18.8 

25.0 

32 

33.4 

41.0 

29.9 

40.2 

64 

44.4 

52.3 

41.5 

51.1 

128 

51.7 

57.3 

47.1 

55.0 

256 

54.6 

58.8 

51.6 

57.2 

512 

56.9 

58.6 

55.5 

59.0 
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FFT Performance 

The Basic Math Library contains three routines for doing Fast Fourier Transforms (FFTs) in place: 
complex to complex (forward and inverse), real to complex (forward), and complex to real (inverse). 

A MFLOPS rating is the average of 100 iterations of both the forward and inverse transforms. The 
table initialization was excluded from the computation timings. 

The two tables that are given display the performance (in MFLOPS) of the available FFT routines 
in both single and double precision. 


Table 4-1. FFT: ID Complex to Complex Performance (MFLOPS) 


N 

CFFT1D 

ZFFT1D 

32 

42.5 

29.2 

64 

57.9 

35.6 

128 

64.3 

41.1 

256 

68.4 

43.9 

512 

71.1 

45.3 

1024 

58.9 

42.1 

2048 

61.1 

24.6 
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Table 4-2. FFT: ID Real to Complex/Complex to Real Performance (MFLOPS) 


N 

SCFFT1D/ 

CSFFT1D 

DZFFT1D/ 

ZDFFT1D 

32 

18.1 

14.4 

64 

30.5 

20.4 

128 

44.1 

25.2 

256 

53.3 

29.0 

512 

61.3 

30.7 

1024 

66.0 

32.0 

2048 

56.8 

27.2 






